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Abstract 

This paper addresses the estimation of parameters of a Bayesian net- 
work from incomplete data. The task is usually tackled by running the 
Expectation-Maximization (EM) algorithm several times in order to ob- 
tain a high log-likelihood estimate. We argue that choosing the maximum 
log-likelihood estimate (as well as the maximum penalized log-likelihood 
and the maximum a posteriori estimate) has severe drawbacks, being af- 
fected both by overfitting and model uncertainty. Two ideas are discussed 
to overcome these issues: a maximum entropy approach and a Bayesian 
model averaging approach. Both ideas can be easily applied on top of 
EM, while the entropy idea can be also implemented in a more sophis- 
ticated way, through a dedicated non-linear solver. A vast set of exper- 
iments shows that these ideas produce significantly better estimates and 
inferences than the traditional and widely used maximum (penalized) log- 
likelihood and maximum a posteriori estimates. In particular, if EM is 
adopted as optimization engine, the model averaging approach is the best 
performing one; its performance is matched by the entropy approach when 
implemented using the non-linear solver. The results suggest that the ap- 
plicability of these ideas is immediate (they are easy to implement and to 
integrate in currently available inference engines) and that they constitute 
a better way to learn Bayesian network parameters. 



1 Introduction 

This paper focuses on learning the parameters of a Bayesian network (BN) 
with known structure from incomplete samples, under the assumption of MAR 
(missing-at-random) missing data. In this setting, the missing data make the 
log-likelihood (LL) function non-concave and multimodal. The most common 
approach to maximize LL in the presence of missing data is the Expectation- 
Maximization (EM) algorithm [4], which generally converges to a local maxi- 
mum of the LL function. The EM can be easily modified to maximize, rather 
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than LL, the posterior probabihty of the data (MAP), as well as other penalized 
maximuni likelihood ideas [Hi Sec 1.6]. Generally, maximizing MAP rather 
than LL yields smoother estimates, less prone to overfitting [5]. In the follow- 
ing, we refer to the function to be maximized as the score. Although we focus 
on BN learning, the ideas are general and shall apply also to other probabilis- 
tic graphical models that share similar characteristics in terms of paramater 
learning. 

In order to reduce the chance of obtaining an estimate with low score, a 
multi-start approach is adopted: EM is started from many different initializa- 
tion points, eventually selecting the estimate corresponding to the run that 
achieves the highest score. We argue however that this strategy has some draw- 
backs. Firstly, the estimate that maximizes the score can well do so because 
of overfitting. Even if the MAP estimation is adopted, the fixed structure and 
the amount of data may lead to overfitting, because they might not fully repre- 
sent the distribution that generated the data. Secondly, the estimates produced 
by the different EM runs are typically very different from each other, and yet 
achieve very close scores [71 Chap. 19]. Choosing the single estimate with 
highest score implies in model uncertainty, because the estimates with slightly 
worse score are completely ignored. Overall, the score alone does not seem 
to be powerful enough to identify the best estimate. Note that the challenge 
presented here does not involve model complexity: all the competing estimates 
have the same underlying structure and thus approaches such as the Bayesian 
Information Criterion (BIC) do not apply. 

We propose and compare two approaches to replace the criterion of selecting 
the highest score estimate, both based on well-known ideas already applied in 
other contexts. The first is based on the principle of maximum entropy and 
the second on model averaging. The maximum entropy criterion can be stated 
as: " when we make inferences on incomplete information, we should draw them 
from that probability distribution that has the maximum entropy permitted by the 
information which we do have " [S] ■ We interpret this principle by first discarding 
estimates with low score, which we assume to be poor, and then by selecting the 
most entropic estimate among the remaining ones. The entropy-based criterion 
is expected to yield parameter estimates that are more robust to overfitting than 
those from the criterion of maximum score. In its simplest version, we apply the 
entropy principle on top of a multi-start EM; in a more sophisticated fashion, we 
implement it using a non-linear solver. In |12j . a similar idea has been explored 
to fit continuous distributions from only partial knowledge about moments or 
other features that are extracted from data. The model averaging idea is instead 
inspired by Bayesian Model Averaging (BMA, see [5]), and is designed to be used 
on top of the multi-start EM. BMA is a technique specifically designed to deal 
with model uncertainty, which prescribes to average the predictions produced 
by a set of competing models, assigning to each model a weight proportional to 
its posterior probability. In the literature, BMA has been often used to manage 
ensembles of BN classifiers; yet, most attempts were not successful, as reviewed 
by [2] . The problem is that a single model becomes usually much more probable 
than any competitor, and thus there is little difference between using BMA or 
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the single most probable model (that is, the maximum score one). Our approach 
is to apply BMA locally to each conditional probability distribution that defines 
the BN; this allows us to instantiate a single BN model, whose parameters can 
be readily inspected, instead of dealing with a collection of models. 

Sections|3]and|3]present a vast amount of experiments with different BNs and 
missingness processes. These experiments, performed on a grid of computers, 
would have taken more than nine months if run on a single desktop computer. 
They consistently show that both the entropy and the BMA-based approaches 
provide better estimates than the maximum score estimation, and that these 
better parameters also result in better inferences with the resulting BNs. 

2 Methods 

We adopt Bayesian networks as framework for our study, even if the discussion 
of this paper may be relevant to the parameter learning of other probabilistic 
graphical models. Therefore, we assume that the reader is familiar with basic 
concepts of Bayesian networks A Bayesian network (BN) is a triple {G, X, V), 
where 5 is a directed acyclic graph with nodes associated to random variables 
X = {Xi , . . . , Xn } over discrete domains {^^Xi , • • • , } and P is a collection 
of probability values pixjl-Kj) with J2x eUx ^'(^il^i) ~ where Xj € Q,Xj is 
a category of Xj and tTj e x xeiij^x an instantiation for the parents Ilj of 
Xj in Q. Furthermore, every variable is conditionally independent of its non- 
descendants given its parents. Given its independence assumptions, the joint 
probability distribution represented by a BN is obtained by p(x) = Y[j Pi^j [""j)) 
where x e fix and all Xj^-Kj (for every j) agree with x. Uppercase letters are 
used for random variables and lowercase letters for their corresponding cate- 
gories. The graph Q and the variables X (and their domains) are assumed to 
be known; 9^\^ is used to denote the probability p{-v[w) (with V,W C X). 
Given the data y = (y^, . . . , y^) with N instances such that y* G i^Yi and 
C A' is the set of observed variables of instance i, we denote by the 
number of instances of y that agree with the configuration w. The goal is to 
learn V, which is usually done by maximizing the penalized (log-)likelihood (LL) 
of y: 

6 = argmax S'e(y) = argmax j ^ log^yi -I- a{9) ] , 

where a is the penalty term. The argument y of Sq is omitted from now on (S 
is the acronym for score). We use the penalized LL because one might simply 
set the penalty to zero to obtain the standard likelihood, or to \ogp{6) to get 
the MAP estimation. For ease of expose, we assume: 

n 

«w=iognnndC' 

with 

o^xj,7Tj — To — TTo — T' which is the MAP version with equivalent sample size 
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set to one. For a complete data set (that is, 'V = X for all i), we have a concave 
sum of logarithms on 6: 

n 

where N^.^^. = N^^,„. + a^.,„., and the estimate 4^,^^ = iV^.,^./(Ex,- K^nj) 
achieves maximum score. In the case of incomplete data, we have 

N n 

5e=^iog^n^-^K+"W' (1) 

where x = (y*, z*) = (x^, . . . , xjj) represents an instantiation of all the variables. 
No closed- form solution is known, and one has to directly optimize max^^e, 

subject to VjVTTj. : "^x, ^Xjl-rrj — 1, ^j^Xj^TTj '■ (^Xj\-Kj > 0. 

The most common approach to optimize this function is to use the EM 
method, which completes the data with the expected counts for each missing 
variable given the observed variables, that is, variables are completed by 

"weights" ^|.|yi for each i,j of a missing value, where 6^ represents the current 
estimate at iteration k. This idea is equivalent to weighting the chance of having 
.Zj = Zj by the (current) distribution of Zj givcm y* (this is known as the E-step, 

and requires inferences over the network instantiated with V = 6^). Using these 
weights together with the actual counts from the data, the sufficient statistics 
values N^-^TTj computed for every Xj,nj, and the next (updated) estimate 
6''+^ is obtained as if the data were complete: O'^+^l^ = N'^^ .^J{Y,x.N'^.^.), 
where N'^, ^. — N^, ^, + oixj.Trj as before (this is the M-step). Because in the 

first step there is no current estimate 9°, an initial guess has to be used. Using 
the score to test convergence, this procedure achieves a saddle point, which is 
usually a local optimum of the problem, and may vary according to the initial 
guess 0*^. In view of obtaining an estimate with high score, it is common to 
execute multiple runs of EM with distinct initial guesses and then to take the 
estimate with maximum score among them. However, it is often the case that 
many distinct estimates have very similar score, and simply selecting the one 
with the highest one is clearly an over-simplified decision, because equal (or 
almost equal) scores cannot be used as a criterion to find the best estimate. 

2.1 Entropy 

High score is not the only target in order to obtain a good estimate. There are 
many estimates that lie within a tiny distance from the global maximum and 
can be as good as or better than the global one. A simple alternative approach 
is to pick the parameter estimate which has maximum entropy, among those 
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which have a high score [B]. Therefore, a possible criterion is 

n 

= argmax ^ ^ ^ 61^^ I log 6*^^ I , (2) 

subject to Se > c ■ s* , for a given < c < 1, where s* is the maximum score. 
The optimization of Eq. ([2]) computes the maximum entropy distribution with 
the guarantee that its score is high (within a smah difference from the maximum 
value). Eq. ^ is referred to as local entropy in [ID]. 

The maximum entropy is used because it leads to the most conservative 
(least informative) distribution over the set of all estimates that achieve score 
as good as c • s*. The constraint that bounds the score is not as simple as 
it reads: in fact it is necessary to use all the equations that define the score 
function to force it to be greater than a certain value, if one wants to use a 
non-linear optimization suite. On the other hand, a simple implementation of 
entropy can be done by using the many runs of the EM method. The idea is to 
select, among the estimates returned by the different EM runs that achieve a 
high enough score (compared to the maximum obtained one) , the estimate with 
maximum entropy. This differs from the usual maximum entropy inference 
in the way that it first checks for high score estimates and then maximizes 
entropy among them. Given the usual great number of parameters to estimate 
in a Bayesian network, restricting ourselves only to those estimates that have 
exactly equal score is undesired: usually only the top scoring estimate will be 
left. 

A related maximum entropy approach is shown in |il2 , with two main differ- 
ences: (i) they focus on a continuous scenario with a compact parametrization, 
e.g. using mean/variance or other similar features of the data, which implies 
in more estimates that equally fit the data (our setting has categorical data 
and there are tens to hundreds of parameters to estimate) ; (ii) they force their 
estimator to have likelihood precisely equal to the value of the maximum score 
(which in our case would be something similar to using c — 1). We note that 
they have also extended their idea to a so-called regularized version, which in- 
cludes a penalty in the entropy function. This is shown to become similar to 
the MAP estimation. We work differently by allowing some variation in the 
score without the use of an extra penalization for that purpose, as we consider 
all estimates with high score as equally good (they are later discriminated by 
their entropy). In a BN with more than a couple of variables, the number of 
parameters to estimate becomes quickly large and there is only a very small (or 
no) region of the parameter space with estimates that achieve the very same 
global maximum value. However, a feasibility region defined by a small per- 
centage away from the maximum score is enough to produce a whole region of 
estimates, indicating that the region of high score estimates is almost (but not 
exactly) flat. This is expected in a high-dimensional parameter space. 
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2.2 Bayesian Model Averaging 

A BMA-based approach can also be used to overcome the model uncertainty and 
overfitting problems. The BMA is applied on the alternative estimates returned 
by the various EM runs in order to obtain a final single estimate. The rationale is 
as follows: if we consider as query the variable Xj given its parents, the posterior 
probability distribution p[Xj\TTj) returned by an inference corresponds to the 
distribution which is specified in the conditional probability table of the BN for 
Xj. To answer this query using BMA, we use the estimates identified by each 
run of EM. We average the returned inferences from models identified by each 
EM run, using weights proportional to the score achieved by them0 We repeat 
this query for each j and each combination of iTj] eventually we instantiate a 
single BN, by setting = p(xj|7rj), where p{xj\TTj) is the BMA-averaged 

inference. In practice, this can be done by simply averaging the coefficients of 
the conditional probability tables. Thus, BMA is locally applied to estimate 
each conditional probability distribution; we denote as the estimate obtained 
in this way. Note that the inferences returned by match those produced by 
the standard BMA (which always computes the answer by querying over all the 
estimated models and then averaging the returned values) only for queries on 
Xj given its parents, and not on more general queries. However, it is generally 
not possible to obtain a summary estimate which exactly matches the inferences 
produced by the standard usage of BMA. In fact, the standard BMA can be 
seen as an ensemble of BNs; to get a single estimate, one has to average the 
joint distributions of these BNs. This would produce a representation of the 
joint probability distribution that is not guaranteed to factorize as the original 
BN structure. Moreover, it would be very demanding from the computational 
viewpoint. An exception exists for naive structures [3], while our BMA-based 
approach can be used to estimate the parameters of any BN. 

3 Experiments with EM as Underlying Engine 

In order to compare entropy and BMA approaches, we perform experiments 
using different network structures (Asia, Alarm and randomly generated net- 
works), sample sizes (n=100, n=200) and percentages of missing data mp 
(mp=30%, mp=60%). We also include the maximum score approach in the 
experiments to serve as baseline, which we refer to as MAP (this is similar to a 
penalized LL, as defined in Sec. [5]). A triple (network structure, n, mp) identifies 
a setting. For each setting, we perform 300 experiments, where each experiment 
is organized as follows: a) random draw of the parameters of the reference net- 
work; b) sampling of n complete instances from the reference network; c) appli- 
cation of a MCAR missingness process, which turns each value of the instances 

^Actually, BMA requires to average using as weights the posterior probabiUty of each 
model, obtained as a product of its prior probabiUty and its marginal likelihood, which implies 
in averaging over the parameters of the models. 
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into missing with probability mpH d) execution of 30 runs of EM from differ- 
ent initializations, using the MAP-based score; e) choice of the estimate using 
MAP, entropy and BMA. To evaluate the quality of the estimates, we measure 
the KL-divergence between the joint distribution represented by the reference 
network and the estimated networks (denoted as joint metric). To measure the 
quality of the inferences produced by the models obtained with different crite- 
ria, it is necessary to select queries of interest. Having seen in experiments that 
both BMA and entropy yield consistently better estimates than MAP, we have 
chosen a query that could mitigate the differences among methods, in order to 
conservatively assess the advantage (in terms of inferences) of both BMA and 
entropy over MAP. In particular, we query the marginal joint distribution of all 
leaf nodes, without any evidence set in the network {leaf metric). This requires 
marginalizing out all non-leaf variables, so it involves all variables in the com- 
putation. Because of that, local "mistakes" in estimates can compensate each 
other, making harder to assess differences among methods. This is a desired 
characteristic if one wants to understand how strong is the difference among 
the ideas. Since KL-divergences are not normally distributed, we analyze the 
results through the non-parametric Friedman test with significance level of 1% 
(which is reasonably strong). To prevent issues from multiple comparisons, we 
performed the post-hoc of the test via Tukeys Honestly Significant Difference. 
Hence, the analysis yields a rank of methods for each setting and each metric. 
As for entropy, we choose the maximum entropy estimate among those whose 
score was at least as high as 95% of the highest score. 

3.1 ASIA Network 

This set of experiments uses the structure of the Asia network 9 . In all settings 
(shown in Table [Ij and the two metrics (joint and leaf), the Friedman test 
returned the rank: 1) BMA; 2) entropy; 3) MAP. To better understand the 
quantitative difference among them, we report in Table [T] the relative medians 
of KL divergence, namely the medians of BMA and entropy in a certain task, 
divided by the median obtained by MAP in the same task. This allows us to 
see the quantitative improvement of them over MAP. 

The improvement of the median over MAP ranges, depending on the task, 
from 2% to 29% for BMA and from 1% to 14% for entropy. Moreover, the im- 
provemenet is consistent, occurring in all settings. Interestingly, the difference 
in performance increase when the learning task is more challenging. For in- 
stance, the advantage of BMA over entropy, and of both BMA and entropy over 
MAP, increases with the percentage of missing data. Conversely, the greater the 
sample size the easier the learning task, thus the performance of the methods 
are more similar for n—200 than for n=100, even though the differences remain 
statistically significant in all the cases. As a result of the conservative design of 
the query, the differences among methods are generally less apparent in the leaf 

■^MCAR (or missing completely at random) indicates that the probability of each value 
being missing does not depend on the value itself, neither on the value of other variables. 
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n= 

30% 


100 


60% 


mp= 


n= 

30% 


200 


60% 




BMA 


entr. 


BMA 


entr. 


BMA 


entr. 


BMA 


entr. 


joint 


0.90 


0.96 


0.79 


0.90 


0.92 


0.96 


0.81 


0.91 


leaf 


0.93 


0.92 


0.87 


0.86 


0.98 


0.99 


0.92 


0.89 



Table 1: Relative medians of KL divergence with the ASIA network, i.e., medi- 
ans of BMA and entropy divided by the median of MAP. Smaller numbers indi- 
cate better performance; in particular, values smaller than 1 indicate a smaller 
median than MAP. Each cell corresponds to 300 experiments. 

metric than in the joint metric. Overall, these experiments indicate BMA as the 
best option, while both BMA and entropy provide significantly better perfor- 
mance than MAP either in the parameter estimates and in the inferences. An 
insight of the reason for entropy to outperform MAP is given by Figure [U which 
clearly shows that a higher MAP score does not necessarily imply in a better 
estimate; instead, when dealing with estimates of high MAP score, entropy is 
more discriminative than the MAP score and has also a stronger correlation 
with the KL divergence. 




Figure 1: Relation between KL divergence, entropy and score; darker points 
represent lower KL divergence between true and estimated joint distributions. 
The figure refers to one thousand EM runs performed on an incomplete training 
set of 200 samples. 



3.2 ALARM Network 

The ALARM network has 37 nodes and 8 leaves [Ij. Again, we consider mp of 
30% and 60%, and sample size n of 100 and 200. As for the joint metric, in all 
settings the rank was: 1) BMA; 2) entropy; 3) MAP. As for the leaf metric, in 
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all settings but one the rank was: 1) BMA & entropy; 2) MAP. The relative 
medians of KL divergence, reported in Table [2l show large differences on the 
joint metric than on the leaf one; only a slight difference exists between BMA 
and entropy in the latter case, while the advantage of both ideas over MAP is 
clearer. On the joint metric, BMA is by far the best-performing approach. Also 
in this case, the difference among ideas is emphasized when mp increases or the 
sample size decreases. As before, BMA achieves the best results: it provides the 
better parameter estimates and it is at least as good as entropy for inferences. 
BMA and entropy consistently outperform MAP in the quality of parameter 
estimates and inferences. 



Metric 




n= 

30% 


100 

mp— 


60% 


mp= 


n= 

30% 


200 

mp= 


60% 




BMA 


entr. 


BMA 


entr. 


BMA 


entr. 


BMA 


entr. 


joint 


0.85 


0.93 


0.79 


0.88 


0.89 


0.93 


0.82 


0.89 


leaf 


0.96 


0.95 


0.94 


0.94 


0.98 


0.97 


0.97 


0.96 



Table 2: Relative medians of KL divergence for experiments with the ALARM 
network. Each cell regards 300 experiments. 



3.3 Randomly generated networks 

In the case of randomly generated structures, the experimental procedure de- 
scribed in Section |3] also includes the generation of the random structure, which 
is accomplished before drawing the parameters. Given two variables Xi and 
Xj, an arc from Xi to Xj is randomly included with probability 1/3 if i < j 
(no arc is included if j > i, which ensures that the graph is acyclic and has 
no loops). Furthermore, the maximum number of parents is set to 4 and the 
number of categories per variable ranges from 2 to 4 (randomly chosen too). 
After the graph is generated, the experiments follow as before (see Table [3]). 
On the joint metric, we always obtain the rank 1) BMA, 2) entropy, 3) MAP. 
On the leaf metric, we obtain that same rank in two settings and the rank 1) 
BMA & entropy; 2) MAP in the other two settings. Thus, there is a consistent 
superiority of BMA over entropy in estimating parameters, although this does 
not always implies in a superiority on the leaf queries. Both BMA and entropy 
are superior to MAP in all the cases. 

In summary, these experiments indicate BMA as the best choice: it consis- 
tently yields the best parameter estimates, and on the leaf queries it is either 
the best idea or it ties with entropy. Overall, BMA and entropy are consistently 
better than MAP, both on the joint and on the leaf metric. 
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mp= 


n= 

30% 


100 

mp= 


60% 


mp= 


n= 

30% 


200 


60% 




BMA 


entr. 


BMA 


entr. 


BMA 


entr. 


BMA 


entr. 


joint 


0.78 


0.94 


0.75 


0.89 


0.82 


0.92 


0.80 


0.89 


leaf 


0.89 


0.92 


0.88 


0.88 


0.96 


0.97 


0.95 


0.92 



Table 3: Relative medians for experiments with randomly generated networks 
with 20 nodes. 

4 Experiments using Continuous Optimization 

The maximum entropy criterion, applied to the selection of the parameter es- 
timates, maximizes entropy while guaranteeing the score to exceed a certain 
threshold. So far, we have applied this idea on top of the multi-start EM, which 
selects only among estimates returned by the different EM runs. This approach 
identifies parameter estimates which are better than those from MAP, but usu- 
ally worse than BMA's estimates. An alternative way to implement the idea 
of maximum entropy is to solve directly the non-linear optimization problem. 
This requires to maximize Eq. ([2]) subject to the constraint that the score is 
only marginally smaller than the best score, allowing a more fine-grained way 
to select the estimate than looking at the solutions identified by the different 
EM runs. This approach is referred in the following as C-entropy. The reason 
to analyze such situation is that EM runs tend to return local optima of the 
score function. However, the maximum entropy estimate might well be a non- 
optimum estimate in terms of score. Because of that, even if we increase the 
number of EM runs and use many more initializations (situation which we have 
tested) , the entropy idea is still confined to saddle points of the score function, 
which is only an approximation of a true maximum entropy idea. Therefore, 
we perform experiments with two specific network structures with the aim of 
understanding whether the entropy idea improves by using a continuous opti- 
mization method instead of EM. For ease of expose, we focus only on the joint 
metric, as this is metric in which entropy is consistently inferior to BMA. 

4.1 Experiments with BNl 

Figure [5] shows the structure of the first set of experiments. Variables A (binary) 
and B (ternary) have uniform distributions and are always observed; variables 
U, E and T are binary (assuming states true and false); the value of T is defined 
by the logical relation T = E /\U. Variable T is always observed, while U and 
E are affected by the missingness process. Both U and E are observed if and 
only if T is true. Therefore, E and U are either both observed and positive, 
or non-observed. The missingness process is marH because given T (always 

^More precisely, this missingness process is MAR (as required by EM) but not MCAR 
(missing completely at random); for a discussion of the different kinds of missingness, see for 
instance [3 Sec. 19.1.2]. 
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observed) the probability of U and E to be missing does not depend on their 
values. Under the chosen conditions, E and U were missing in about 85% of the 
sampled instances. We assume the conditional probabilities of T to be known, 
thus focusing on the difficulty of learning the probabilities of nodes U and E. 

G) — CD — Ci> — czy — cz^ 

Figure 2: Network BNl; nodes affected by the missingness process have a grey 
background. 

For this network, we also developed a solver which identifies the global max- 
imum of the MAP score; technically, the solver maps the learning problem into 
a polynomial programming one; more details are given in the supplementary 
material. We use the global solver in two different ways: to compute C-entropy, 
thus ensuring the estimate to have a MAP score close to the global maximum 
(we allowed a tolerance of 1%); to compute the global MAP estimate on its own, 
referred to as global MAP. It is interesting to compare global MAP with MAP 
in order to assess which is the impact of the local solver (that is, EM) on the 
quality of estimates. To the best of our knowledge, there is no previous analysis 
with global exact solvers for learning BNs from incomplete samples. 

The same experimental procedure of Section [3] is used, considering sample 
sizes in {100, 200} and performing 300 experiments for each setting. For both 
sample sizes, the Friedman test on the joint metric returns the following rank: 1) 
BMA; 2) C_entropy; 3) entropy; 4) MAP; 5) global MAP. The relative medians 
were, respectively, for n=100: 0.27, 0.33, 0.37, 1, 1.5; for n=300: 0.14, 0.25, 
0.35, 1, 1.2. These results can be commented from several viewpoints. First, 
C-entropy significantly improves over entropy, almost reaching the same quality 
of BMA. Second, estimates found by the global solver were worse than those of 
MAP; in fact, the penalized MAP function offers only partial protection against 
overfitting; it is indeed less prone than log-likelihood to overfitting, but still an 
estimate which maximizes the MAP score can be well affected by overfitting. 
The value of MAP score achieved by the global solver was only slightly higher 
(around 2.5%) than achieved by the local maximum identified by the multi- 
start EM. Third, BMA and the two entropy implementations outperform MAP, 
confirming the results of the previous experiments. 

4.2 Experiments with BNS 

Network BN3 has structure A B ^ C. We considered two different configu- 
rations of number of states for each node: 5-3-5 (meaning A,C with 5 categories 
and B with 3) and 8-4-8 {A,C with 8 categories and B with 4). In both cases, 
we made B randomly missing on 85% of the instances. Thus, despite the simple 
structure of the network, the learning task is interesting since there are many 
missing data and a moderately high number of states. The two configurations 
requires to estimate from incomplete samples respectively 2 ■ 5 -|- 4 ■ 3 = 22 and 
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8 • 3 + 7 • 4 = 52 parameters; to these numbers, one should add the marginals 
of A, which are however learned from complete samples and whose estimate is 
thus identical for all methods. We fixed n to 300 for the 5-3-5 configuration and 
to 500 for the 8-4-8. In this case, the performance of C_entropy was extremely 
good. We obtained, for both settings, the following rank in the Friedman test: 
1) C.entropy; 2) BMA; 3) entropy; 4) MAP. The global MAP was not run, 
because the number of free unknowns in the continuous optimization problem 
was too high to achieve the global solution in reasonable time. The relative 
medians for the joint metric are (numbers are given in the order: C-Cntropy, 
BMA, entropy and MAP): for the 5-3-5 configuration: 0.71, 0.78, 0.90, 1; for 
the 8-4-8 configuration: 0.68, 0.78, 0.92, 1. These experiments confirm that, in 
order to get the most out of the entropy criterion, it is much more effective to 
have a dedicated solver than applying it on top of the multi-start EM. Here, 
C_entropy is even better than BMA, so its implementation for general settings 
and further analyses are intended in the near future. 

5 Conclusions 

This paper suggests that maximizing (penalized) likelihood or MAP scores is 
not the best choice to learn the parameters of a Baycsian network. In particular, 
a high score is necessary but not sufficient in order to have a good estimate. 
To improve estimation, we propose: (i) a BMA approach that averages over 
the estimates learned in different runs of EM, and (ii) a maximum entropy 
criterion applied over the estimates with high scores. The entropy idea can 
be implemented on top of EM or using a dedicated non-linear solver, which 
allows a more fine-grained choice of estimates. Both BMA and entropy can 
be promptly integrated into any EM implementation at virtually no cost in 
terms of implementation and running time; instead, the non-linear solver for 
entropy requires some additional implementation effort. Thorough experiments 
show that the presented ideas significantly improve the quality of estimates 
when compared to standard maximum penalized likelihood and MAP ideas. 
If EM is used as optimization engine, then BMA yields by far the best results, 
followed by entropy and MAP. If the dedicated non-linear solver is used, entropy 
performs as good as BMA, or even better. Moreover, for a specific network, we 
developed a global solver for the MAP estimation. We showed that its score is 
only slightly higher than the maximum identified by EM runs, and yet it yields 
worse estimates. This corroborates with the other results, indicating that the 
usual scores do suffer from overfitting and/or model uncertainty. 

References 

[1] I. A. Beinhch, H. J. Suermondt, R. M. Chavez, and G. F. Cooper. The 
alarm monitoring system: A case study with two probabilistic inference 



12 



techniques for belief networks. In Second European Conference on Artificial 
Intelligence in Medicine, volume 38, pages 247-256, 1989. 

[2] J. Cerquides and Ramon Mantaras. Robust bayesian linear classifier ensem- 
bles. In Proc. European Conference on Machine Learning (ECML) 2005, 
volume 3720 of Lecture Notes in Computer Science, pages 72-83. Springer, 
2005. 

[3] D. Dash and G.F. Cooper. Exact Model Averaging with Naive Bayesian 
Classifiers. Proc. of the 19th International Conference on Machine Learn- 
ing, pages 91-98, 2002. 

[4] A. P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from 

incomplete data via the cm algorithm. Journal of the Royal Statistical 
Society. Series B (Methodological), 39(1):1 38, 1977. 

[5] J. A. Hoeting, D. Madigan, A.E. Raftery, and C.T. Volinsky. Bayesian 
model averaging: a tutorial. Statistical science, 14(4):382-417, 1999. 

[6] E. T. Jaynes. On the rationale of maximum-entropy methods. Proc. IEEE, 
70(9):939-952, 1982. 

[7] D. KoUer and N. Friedman. Probabilistic Graphical Models. MIT press, 
2009. 

[8] S.L. Lauritzcn. The EM algorithm for graphical association models with 
missing data. Computational Statistics & Data Analysis, 19(2):191-201, 
1995. 

[9] S.L. Lauritzen and D.J. Spiegelhalter. Local computations with probabili- 
ties on graphical structures and their application to expert systems. Journal 
of the Royal Statistical Society. Series B (Methodological), 50(2):157-224, 
1988. 

[10] T. Lukasiewicz. Credal Networks under Maximum Entropy. In Proceedings 
of the 16th Conference on Uncertainty in Artificial Intelligence, pages 363- 
370. Morgan Kaufmann Publishers Inc., 2000. 

[11] G. M. McLachlan and T. Krishnan. The EM Algorithm and Extensions. 

Wiley, New York, 1997. 

[12] S. Wang, D. Schuurmans, F. Peng, and Y. Zhao. Combining statistical lan- 
guage models via the latent maximum entropy principle. Machine Learning, 
60(l-3):229-250, 2005. 



13 



